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OPTIMAL DESIGN AND USE OF RETRY IN FAULT TOLERANT 


REAL-TIME COMPUTER SYSTEMS 1 


Yann-IIang Lee and Kang G. Shin 

ABSTRACT 


In this paper, we present a new method for (i) determining an opt] 
and (it) using retry for fault characterization. 



First, we derive an optimal retry policy for a given fault characteristic, which deter- 
mines the maximum allowable retry durations so as to minimize the total task comple- 
tion time. Then, we carry out the combined fault characterization and retry decision, in 
which the characteristics of fault arc estimated simultaneously with the determination of 
the optimal retry policy. We have developed two solution approaches; one is based on 
the point estimation and the other on the Bayes sequential decision. The maximum 
likelihood estimators are used for the first approach, and the backward induction for 
testing hypotheses in the second approach. 

We also present numerical examples in which all the durations associated with 
faults (i.c. active, benign, and inter-failure durations) have monotone hazard rate func- 
tions, c.g,, exponential, Weibul! and gamma distributions. These are standard distribu- 
tions commonly used for modeling and analyses of faults. 


Categories and Subject Descriptors: B.2.3 [Arithmetic and Logic Structures]: Relia- 
bility, Testing and Fault-Tolerance -- hazard rate function, recovery overhead, optimal 
retry policy, fault characteristic; G.3 [Probability and Statistics] — estimation, cen- 
sored sampling, likelihood ratio, sequential or Bayes decision problem, hypotheses testing. 
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1. INTRODUCTION 

There are three types of fault in computer systems; transient, intermittent, ahd 
permanent [l]. Transient faults die within a certain time of their generation, intermits 
tent faults cycle between being active and inactive, and permanent faults arc (as the 
term indicates) permanent, It has been found that permanent faults form but a small 
fraction of the faults in computer systems (2,3), This makes the purging of any faulty 
components as soon as they have been discovered an inefficient means for handling 
redundancy. If the active duration of a transient or intermittent fault is short, the con* 
linuation of the task with the same resource after the disappearance of the fault may be 
more efficient than that of using other recovery methods. Unfortunately, it is impossible 
to tell at its first occurrence whether or not the fault is permanent and also impossible 
to know its active duration if the fault is intermittent or transient. Moreover, it would 
be much more efficient (timc*wisc) and accurate to characterize faults on-line, and then 
take the appropriate recovery actions. In this paper, we propose (i) determining an 
optimal retry policy so as to minimize the task completion time, and (ii) using retry in 
conjunction with statistical estimation and decision theory to characterize faults. We 
obtain the optimal retry duration in the face of uncertainty about the nature of a 
detected fault. Since our focus is on real-time systems, we are principally concerned with 
skewing the density function of the task completion time as much to the left as possible. 
For this reason, we shall concentrate on maximizing reductions in response time. 

As the term implies, retry consists of restoring the affected process to some fault- 
free initial state, and then re-running it on the same processor. Clearly, retry is only 
applicable when the error induced by a fault is confined and the process can be restored 
to integrity. The most efficient means for fault confinement arc signal-level detection 
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mechanisms which can detect faults immediately upon their occurrence [4j. For the pro- 
cess to be rcstorablc to integrity, scratchpad registers arc needed. Results obtained by 
Carter, el al. (5] indicate that self-checking and retry mechanisms can be incorporated 

i 1 

into processors inexpensively, and without substantially degrading performance. 

M 

M 

Currently, several commercial machines incorporate retry. In the Honeywell 0000 ij 

ii 

[6], instruction retry is reported to approach an effectiveness rate of 100%. Retry in the l| 

!l 

IBM 300 and 370 scries machines is widely used in the peripheral areas (I/O and storage) ij 

li 

as well as in the central processor [7], The UNIVAC 1100/00 uses a hardware timer that 

goes off after an interval judged to be long enough to allow transient faults to die out, j 

upon which retry can be effected [8]. However, no discussion or justification about the 

| 

retry duration or the number of retry attempts used has been addressed. li 


The usefulness of retry mechanisms arises, as we said above, from (i) the smallness » 

of the proportion of permanent faults in any computer system, and (ii) the fast recovery 

from non-permanent faults and thus the small task completion time. In the case of a [ 

! 

i 

permanent fault, to retry a process on the affected processor is worse than useless: it is a j 

waste of time. To hasten the completion of the executing task when a fault is detected, 

J 

we must control the duration of retry to maximize the difference between the expected ; 

ij 

1 

gain in response time that results from using retry when the fault is transient or inter- 

l 

mittent (in some eases), and the expected loss that results from using it when the fault is 
permanent or intermittent (in other cases). Our object in this paper is to derive the max- 
imum allowable retry duration r* when a fault is detected. If the retry succeeds with this 
duration, the execution continues. If uot, other methods for error recovery, c.g., rollback 
or restart following the system reconfiguration, must be used. 
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In addition to the performance gain in case of a successful retry, the characteristic 
of a fault can be monitored through retries. For instance, a retry which succeeds after 
the retry duration r' implies that the active duration of the fault is also less than or 
equal to r\ 2 Even when the retry fails, it indicates that the active duration of the fault 
is greater than r', On the other hand, the detection of fault gives information regarding 
the duration of fault occurrence and the benign duration of an intermittent fault. Thus, 
it becomes possible to observe the nature of fault through both retry and detection 
mechanisms. Note, however, that the information obtained from retry is censored, since 
for example, in case of an unsuccessful retry, the sampling via retry is stopped while the 
associated fault is still active. With the censored information, the problem of estimating 
the nature of fault is the same as the design of experiments in the sequential analysis 
where the experiments arc described by the retry policy, and the sampling is analogous 
to detection and retry. 

The paper first presents a brief description of fault models in Section 2. It should 
be obvious that r* will depend on fault behavior, and in Section 3, we begin with how to 
derive it, given the fault characteristics. When quantitative descriptions of fault behavior 
arc hard to come by in the real environments, the combination of retry and detection 
enables us to observe the fault characteristics, while determining the optimal retry pol- 
icy, We counter this in Section 4 by showing how to use statistical estimation theory to 
create a system that learns via retry the fault characteristics as it goes, and therefore 
becomes increasingly more "optimal" in the sense of minimizing the task completion 
time. Due to its repetitive reappearances, retry of an intermittent fault is a renewal pro- 
cess. In Section 5, we apply the Bayes sequential decision to fault characterization and 

^This is not really true due to the fault latency, The fault latency [i] which is defined as the interval 
between the moments of fault occurrence and error generation has no effect on retry. Thus, we simply ig- 
nore the fault latency in the consideration of retry. 
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retry decision. The backward induction for testing hypotheses is also presented as a 
solution to the sequential decision problem. The paper concludes with Section 0. 

In what follows, we will use continuous retry durations instead of the number of 
retry attempts. 3 

2. BASIC MODEL OF FAULTS 

Let a unit be the smallest hardware system component that the detection mechan- 
isms can distinguish when a fault is detected, e.g., a processor module can be viewed as 
an assembly of such units. Hence, the term "module' 1 will mean a system component 
larger than a unit. Typically, a module is formed by a set of units. For each unit, the 
fault's behavior can be modeled by a three-state stochastic process as in Figure I. 
Denote these throe states, namely non-faulty, fault-active, and fault-benign by NF, F 
and FB {see [4] for their detailed descriptions). At NF, no fault exists in the unit. 
Transition from NF to F indicates the occurrence of a fault. If the fault is intermittent 
and becomes benign following an active duration, the state of the unit changes to FB. 
The unit may move back to F when this intermittent fault recurs — this is referred to as 
the reappearance of the intermittent fault. If the fault is transient and disappears, the 
unit will transfer from F back to NF. The model similar to this has been widely used in 
the reliability analyses and the modeling of faults [4,9,10]. 

Let Tj, T !T“ and T\ denote the duration between two successive fault occurrences, 
the active duration of a transient fault, and the active and benign durations of an inter- 
mittent fault, respectively. These durations are random variables with distributions Fj, 
F°, Ff and F b , and density functions //, f}, /• and respectively. For simplicity, we 

"Conversion between a retry duration and its corresponding number of retry attempts is not difficult as 
discussed in Conclusion. 
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assume that these durations arc mutually independent and that the causes of triggering 
different types of fault are not correlated, The latter assumption implies that the 
occurrence of any type of fault can be modeled as a Bernoulli process with probabilities 
p tl and p p for transient, intermittent and permanent faults, respectively. Thus, the 
characteristics of a fault can be represented by a 7-tuple 
C f e {(/>i, Pi, P p , Fj, FI, F}, fj) | P(+M-p p =l}. 

Usually, the mean time between the occurrences of fault, E[Tft, is much larger than 
any other durations. Thus, it is reasonably accurate to assume that there is at most one 
fault in a given unit at any moment. In addition, it is assumed that the reappearance of 
an intermittent fault is never mistaken for the occurrence of a nett fault within a unit.' 1 
Following a successful retry, the detection mechanisms should be able to recognise the 
type of fault in a unit by continuously monitoring normal operation. When the detec- 
tion mechanisms find the same unit failing again within a short period, the unit is 
declared to have an intermittent fault. If the fault has disappeared for a tong period, it is 
regarded as a transient fault. 

3. OPTIMAL RETRY POLICY FOR GIVEN C f 
3.1. General Problem Statement 

Once a fault is detected, it is necessary to take a proper sequence of actions such as 
fault isolation, system reconfiguration, and recovery. For convenience, define the 
recovery overhead as the total time required to resume the normal system operation in 
case of the detection of a fault; this is a system-oriented view. On the other hand, the 
occurrence of fault may delay the completion of the executing task; this is a task- 
' t This is the very reason why the term "unit” is introduced here. 
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oriented view. These two views arc equivalent when the fault is transient or permanent. 
For an intermittent fault, since it appears and disappears repetitively, the accumulated 
overhead of retry could become unacceptably large (eventually infinite). In view of this 
fact, an intermittent fault has the same undesirable effects on computer performance as 
a permanent fault when retry is used as the sole means of recovery. Consequently, as 
far as the minimization of the expected recovery overhead (i.e. the system-oriented view) 
is concerned, an intermittent fault can be regarded as a permanent fault and hence retry 
should not be used when detection mechanisms find again the same fault that was 
detected but became inactive during the last retry. Once intermittent faults arc treated 
just like permanent faults, the optimal retry policy of minimizing the total recovery 
overhead becomes equivalent to a special retry policy (for transient faults) which minim- 
izes the task completion time. More on this will be discussed near the end of this subsec- 
tion. 

A most attractive gain from retry is the rescuing of the executing task, i.e., the 
task-oriented view of retry. Suppose there is enough redundancy so that the system 
may be reconfigured and the affected task may be migrated to other fault-free modules 
when a module becomes faulty (due to one or more faulty units within the module). It is 
obvious that no task should be started on any faulty or potentially faulty module (hav- 
ing one or more units with benign intermittent fault(s)). Consider a practical case in 
which a module (i) becomes faulty once and gets back to normal during execution of a 
task, and (ii) never becomes faulty again before the task is completed. In such a ease, it 
is possible to avoid the overhead of migrating and restarting the task by means of a suc- 
cessful retry, leading to a fast completion of the task. Even if the fault that occurred 
was intermittent, retry is the best recovery method when its active duration is short and 
benign duration is long, insofar as the completion time of the executing task is 
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concerned. 

Considering the task-oriented view of retry, we will derive an optimal retry policy 
for a given fault characteristic, Cj, which minimizes the expected task completion time 
when a fault is detected during the task execution. Such an optimal policy would be of 
significant value to real-time applications where small task completion times are of 
paramount importance. 

Let x 0 denote the computation time initially needed to complete the task under a 
fault-free condition. When a fault is detected, the amount of computation remaining to 
complete the task, i.c., residual computation, is denoted by x, where 0<z<z o . For the 
»-th detection of the same fault when the residual computation x and the characteristic 
Cj are both given, the optimal retry policy should specify a maximum allowable retry 
duration , r'(z,Cy). When the detected fault may be transient, intermittent, or per- 

manent, since the fault type is unknown; but it is intermittent if h> 2. Since the 
optimal retry policy is to minimize the time necessary to complete the residual computa- 
tion x regardless of what has happened in the past, we have the following lemma. 

Lemma 1: rf(x,C/) — rf(x,Cj) for all i, j > 2. 

Thus, we have to consider two maximum retry durations for two different cases: 
the ease when a new fault is detected, and the case when an old intermittent fault is 
detected again. Let R => {(rj(z,Cy), r 2 (z,Cy)) | 0<2<x o } be a retry policy where the 
maximum retry durations is r^{x,Cf) or r 2 (z,Cy) for the detection of a new fault or an old 
intermittent fault with the residual computation z and fault characteristic Cj. For nota- 
tional simplicity, we shall use r,-, whenever convenient, in the sequel, to represent 
r,{x,Cf), »’*= 1,2. Also, denote the expected times needed to complete the residual compn- 
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tatiou * by K t (x,Cy,/?), Vn(z,Cf,R), K 3 (z,Cy,/?), and V a (z,Cj,R) when the system is in the 
following situations: execution atarts/resumes ou a non-faulty module, a new fault is 
detected, an old intermittent fault i3 detected again, and execution continues following a 
successful retry for an intermittent fault, respectively. Based on transitions among these 


situations, one can derive the following recursive equations: 

+ l’{i + v^-i,c,,n))dF,it) ( 1 ) 

vfr, <?,,«) = 0 + p, C{‘+v<(z,Ci,n))Jm 

+ {1 - V, F!(r,) - f”(r,)}{ y,«z),e„n) + r, + I.) (2) 

Cfli) - {l-FHr s )}{VM‘),0,M)+r 2 +l,) + Ci.‘+V t {x,Ci,Ii))dr i (l) ( 3 ) 

v t (i,c h n) = (i ~fHz))x + / o ’{(+ v s (z-i,c,,n))dF*(i) W 


where o(z) is the residual computation needed when the system applies recovery methods 
other than retry, c.g., c(x)=z 0 if restart is used following an unsuccessful, retry, and t, is 
the set-up time necessary for system reconfiguration and re-initialization. The optimal 
retry policy, /?'={(ri(ar, C/ ), r 2 (z,Cy)) J Q<z<z 0 }, should minimize both V 2 (x,Cy,/?) and 
K 3 (z,Of,R) for all x, since retry is directly applicable only to the second and third situa- 
tions. (As such, they arc explicitly dependent on t x and r 2 .) Obviously, this policy also 
minimizes V^C^R) and K^XjC'y,/?). 

Since the mean time between failures is usually much longer than the other dura- 
tions, V\{i,Cj,R) can be accurately approximated by z. In general, there are no closed 
form solutions for r|'(x,Cy) and r 2 (z,Cy). However, these optimal retry durations can be 
calculated numerically without difficulty as explained below. With the initial condition 
V r 4 (0,Cy,/f)=0, V 3 (z,Cy,/?) and V^x, Cj,R) can be calculated iteratively using Eqs, (3) and 
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{■!) for any given R. Thus, one can numerically determine rl(z,Cj) so as to minimize 
V^{z,Cj,R)\ V^x,Cj,R) is also determined. Once \\{z,Cf,R) is known, one can easily com- 
pute Vo(z,Cf,R) and therefore r((z,C/), 

When the recovery overhead in place of the task completion time is to be minim- 
ized, r 2 '(x,Cy)=0 for all xG(O,j 0 ). In this case, the recovery overhead can be expressed as 
V«{z,Cj,R) - Kj (x,Cf,R), which is the time spent to restore the system to its state 
immediately before the fault is detected. The optimal retry duration r[{z,Cj) can be 
determined through Eqs. (l)*('0 just as ' vc can compute that for minimizing the task 
completion time. Consequently, we will in the sequel deal with the task completion time 
only. 


3.2. Fault Active Durations with Monotone Hazard Rato Functions 

Since T- is a continuous random variable, one can assume that fi(t) is continuous in 
(0,oo). The hazard rate function of the active duration of an intermittent fault is 
f(t) 

defined by /*?({)£= — . When the hazard rate function of the active duration of an 

1 - F ?(0 

intermittent fault is monotonically increasing, constant, or monotonically decreasing, the 
optimal retry duration rj exhibits interesting properties. These [properties play a signifi- 
cant role in determining the optimal retry policy, since the time durations associated 
with faults are usually modeled to have monotone hazard rate functions. Typical distri- 
butions with monotonically increasing hazard rate functions include the gamma and the 
Weibull distributions with the shape parameters greater than 1. When their shape 
parameters arc less than 1, they have monotonically decreasing hazard rate functions. 
The exponential distribution has a constant hazard rate. Consider first the non- 
deercasing hazard rate function which leads to the following theorem. 
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Theorem 1: When IP t (l) is inonotonically non-decreasing in l, r 2 >=»0 or r 2 <=>oo. 


Proof: Differentiating Eq. (3) with respect to r 2 , we obtain 

+ 1 ~ ) ~i vMWfi) + i.)| 
“ fi(rt)[H x ,Cf>ft) + T^r-v “{«(*)+ Ml 

W(r 2 ) 


( 5 ) 


Since V^x,Ct,R) is independent of the past and current retry durations r 2 (y,Cy) where 
y>x, & V a (x,Cj,R) H — - {fl(:r)+f,} is non-increasing in r n (z,Cf), If there is an r such 

K\h) 

that V+(x,Cj,R) + - {o(z)+t,}<0, then the first derivative of V z (x,Cf,R) with 

^»( r ) 

respect to r 2 is negative for nil r 2 >r. Thus, r 2 =oo, If such an r does not exist, the 
derivative is always non-negative, implying that V$(x,Cf,R) is monotonically non- 
decreasing. This results in r 2 =0. Q.E.D. 


Following the definition of rftx,Cj), r 2 (:t,C7y)=0 implies that no retry be attempted 
for reappearing intermittent faults, whereas r 2 (;r,Cy)=oo means that the retry should be 
applied until the intermittent fault becomes benign. 

Corollary Is When bf{t) is monotonically non-decreasing in f and if there exists an avj 
such that a(xJ) + t,~ zt - (i?J(x 2 )— 1)E(T7] = 0 where /?*(z) is the renewal function [11) 
corresponding to the distribution F((<), then r 2 (st,Cy)=oo if x<xl and r 2 (z,Cy)— 0 other- 
wise. 

Proof: From Theorem 1, r 2 (i,Cy) is either 00 or 0. When r 2 (z,Cy)— 00 , there exists an r 

8 Note that the probability of having a zero benign duration of an intermittent fault should be zero, i.c. 
Pr .fc(r*=0)=0. Otherwise, no useful computation can be done. 
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such that Eq, (5) becomes negntivo, Since ^(j, is monotonically non-decreasing 
functions of i, ttafe also exists an r such that Eq. (6) becomes negative when the resi- 
dual computation is less then x. Thus, rn{y,Cj)=co for ail y<x, Since both the active 
and benign durations arc mutually independent, we have 

^(*,£7/,/?*) = *+ {E[M*)J-i}E[r?) 

where i\\ x) is the number of reappearances of the intermittent fault during the residual 
computation x, namely N{ z)*=*mf{n\ £ where jTJ* is the benign duration fol- 

k<n\ 

lowing the A'-tli occurrence of the intermittent fault. The expected value of N{z), 
B[A( ^)J, is equivalent to the renewal function /?[(*) corresponding to the distribution 
Also, r&x,C/)=> 00 if and only if F 3 (z, £?/,/?)( r2S;00 < V z (x,C f ,n)\ r ^ 0 , i.c., 

/ 0 = e|27| +v < u,c l ,n') < «(*)+i. 

From the equality in the right-hand side of the above equation, wc obtain xX and thus 
the Corollary is proved. Q.E.D. 

Theorem 1 can also be viewed as below using the concept of stochastic ordering 
between two random variables. A random variable X is said to be stochastically larger 
than the other random variable Y if Prab{X>t)'>Prob( Y>t) for all t (12). Let T^(|r) be 
the remaining life of the intermittent fault after retry has been applied for the duration 
r. When the hazard rate function is non-decreasing, T?(\r) is stochastically larger than 
77(|j) provided r<j. Thus, for all s>r, if it is worth continuing retry beyond the retry 
duration r (in the sense of minimizing the task completion time), then we should con- 
tinue the retry even after the retry duration s. Consequently, the retry continues until 
the intermittent fault disappears. 
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Note Hint when the hnznrd rate function is non-decreasing, z* is determined by the 
mean active duration and is independent of the shape of the distribution. could 
become negative when E[T?J is largo, that is, intermittent faults have a long active dura- 
tion. In such a case, Corollary 1 implies that no retry be applied for intermittent fnults. 
On the other hand, if the set-up overhead t, is large, z£ could be even larger than 7 0 , 
implying that retry bo used as a sole means of recovering from an intermittent fault. 

When the hazard rate A“(f) is decreasing, the nice properties stated in both 
Theorem 1 and Corollary 1 do not exist. However, there exists at most one root of Eq. 
(5) that minimizes V3. In such a case, since there is no closed form expression of 
V^(x,Cf,n , ) t we have to resort to (less elegant) numerical techniques for determining both 
r*(x,Cf) and r[[z,Cj\ as was previously mentioned. 

Several numerical examples, in which restart is used as a sole means of backup 
piv. ii'.ery, i.c. a(x)~z 0 , are shown in Figures 2 to 4. In these figures, the durations are 
normalized with respect to x 0 , and the active duration of the intermittent fault is 
assumed to have the gamma or Wcibull distribution. Figure 2 presents jej's when the 
shape parameters a’s of the gamma and Wcibull distributions, respectively, are greater 
than or equal to 1. Figures 3 and 4 show the optimal retry duration r^(x,Gf); the solid 
lines for a<I and the dashed lines for a==l. Note that for the gamma distribution A"(<) 

approaches - as l — *-oo where f) is the scale parameter. Thus, it is possible for the 
P 

derivative of K 3 to be negative, (i.e. Eq. (5) becomes negative), implying rn(x,Cj)—oo. 
For the Weibull distribution with a<l, r 2 * never becomes oo since /f,-(oo)=0. 

Consider the case where T t) Tf and T* arc all exponentially distributed with the 
parameters r, /t, v for the transient fault disappearance rate, the intermittent fault 
disappearance and reappearance rates, respectively. Since the renewal 
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function ftf(z) becomes 1+ux, From Corollary 1, we have r«{x,Cj)=oo if x<t£ and 
rj (x,C'j )=0 if x> 2 «, where av» = •~^,{ ;r o + t t - - ). Since K^z.Cy,/?) implicitly depends 
on rn via V it we can express V^XfCpR') as: 


Vlx,C h lV) 


(1 + -)* 

/< 

, 0+( , + J . «-*(— +J ) 

u p u 


^ # 
Z<Zi 


if 

if x>zl 


( 0 ) 


The derivative of V*(x,Cj,R) with respect to becomes: 

°-— ,C l —) =» p p + P ,<f rr ‘{l - (a'o'Wr*)*'} + PiC~ ,ir '[ 1 - {/o+L-^i.Cy,/?)}^] (7) 

c ' r i 

With rg( >£?/) as determined in Corollary 1 and V^(z,Cf,R‘) as in Eq. (0), Eq, (7) can have 
at most two roots. The optimal retry duration r{{x,Cj) can be obtained by examining 
Vr>(x,C/,ft) at the boundaries, rj=0 and rj—oo, and the roots of Eq. (7). Note that r{ 
cannot be infinite as long as p p >0. Unlike ri, rf does not have to be zero when x> xt. 
Several cases of K>(z,Cy,/?) as a function of r t are shown in Figure 5 where all parameters 
are normalized with respect to j 0 , The case 2 in Figure S shows an example for which 
two positive roots of Eq. (7) exist, Figure 0 presents some numerical results on r{(z,Cj) 
as a function of x. Note that x$ depends upon the ratio of u to ft, whereas r* varies as 
p it Pi and p p change. 


4. OPTIMAL RETRY POLICY AND PARAMETER ESTIMATION 

In Section 3, we have derived an optimal retry policy for a given fault characteristic 
Gj, It is, however, very difficult in practice to know a priori the fault characteristic. 
Even if the fault characteristic is measured during device manufacture, it may well vary 
as the execution environment and the executing tasks change. Another factor that 
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makes the fault characteristics time-variant is the aging of components, c.g., the bathtub 
curve of the failure rate as a function of time (1), Thus, it is important to determine an 
optimal retry policy for uncertain fault characteristics, Note that retry not only pro- 
vides an efficient recovery of task execution, but also monitors the behavior of a fault 
present in a hardware unit. Naturally, it is desirable to integrate the estimation of the 
fault characteristics and the control of the maximum retry durations into a single deci- 
sion problem, In such a ease, the computer system has to adjust its retry policy using 
the information on the faul- b'.havior collected during its past retries. 

The detection mechanisms 6 can be useful in estimating the duration between two 
successive fault occurrences or the benign duration of an intermittent fault. Note that 
this information is crucial in specifying the behavior of fault occurrence or reappearance. 
Consequently, the fault characteristics would become well-defined if a good estimator 
were used. Moreover, retry may lead to an indication of the active duration of a tran- 
sient or intermittent fault, which is, on the other hand, affected by the retry policy 
applied. The information collected is incomplete in case of an unsuccessful retry, since 
the retry is stopped while the associated fault is active. In what follows, we consider the 
estimation of the characteristic of an active fault and the simultaneous determination of 
an opt imal retry policy which minimizes the task completion time. 

Note that the probabilities of having a permanent, transient, or intermittent fault 
arc crucial to the determination of r({x,Cj) but unrelated to that of rn(x,Cj). It implies 
that correlations among successive retry durations during the execution of a task do not 
depend on these probabilities. Thus, to minimize the task completion time, it is 
assumed that these probabilities are determined a priori from the previous observations 

e As was pointed out, we mean here the signal-level detection of faults [4|. 
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of fault occurrence. These probabilities can be estimated accurately if a sufficiently large 
amount of data of fault occurrence were collected. If a sufficient number of samples has 
not yet been obtained, the measured results as in (2,3) have to be used instead. 

Recall that in the determination of r 2 , the transient fault is excluded. Also, if p tl p it 
p p are determined a priori, the effects of retry on the task completion time is a linear 
combination of the effects of transient and intermittent faults when a new fault is 
detected. This implies that the same technique can essentially be used to deal with the 
unknown parameters for both transient and intermittent faults. Consequently, we con- 
fine ourselves to the ease where the density function of the active durations of transient 
faults is known and the active durations of intermittent faults have the density function 
form /"(<|0) with the parameter 0 unknown, (Note that 0 could be a vector if there are 
two or more parameters, c.g., the shape parameter a and the scale parameter ft for the 
Wcibull and gamma distributions.) 

The samples obtained from retry can be represented by a 2-tuple (/, l) where / is a 
single-bit flag and l indicates a duration. 7=0 represents a successful retry, and hence t 
indicates the active duration of the fault. On the other hand, when a retry fails, 7=1 
and t is the retry duration. Let (7,-, t,) i*=l,2,..n denote the past samples related to the 
active duration of an intermittent fault. These resulting samples are type I progressively 
censored, following Cohen’s definition in (13] with continuous censoring times. There 
are several different types of estimators conceivable for estimating the parameter 0 on 
the basis of these progressively censored samples. For the Wcibull and gamma distribu- 
tions, the maximum likelihood estimators have been widely studied as in [13-17] when 
the samples arc progressively censored. For simplicity (but not because of difficulty), we 
shall employ the maximum likelihood estimator 0 of 0 in the sequel. 
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When the fault is still active even after the current retry has been applied for the 
duration r, we shall have collected an additional sample (1, r) via the current retry, Let 
0(r) be the maximum likelihood estimator of 0 which is based on the samples including 
up to the current sample (1, r). Following the current retry duration r, the maximum 
likelihood function becomes 

L(0) »f(W) (8) 

where i)(I,t,0) is defined as: 

W\0) if 7=0 

1-.F"(<|0) if /=! 

A . 

The maximum likelihood estimator 0(r) should maximize L(0) or log L{0). 

A A 

Let the optimal retry durations based on the estimated 0(r) be denoted by r/(z,0(r)) 
A— 1,2 for a newly detected fault and an old intermittent fault, respectively. Use the 
notation Cf}[r)) to indicate that the active durations of intermittent faults have the 

A 

density function /?(<|0(r)), an( l let ^( r ) denote the policy that the maximum allowable 
retry duration for the current retry is r. Then, the direct solution of the optimal retry 
duration is to find the minimum of V^,C(}{r)),R{T)) A— 1,2,3, 4. Notice that the retry 
duration r not only appears in the integral equations (2) and (3), but also affects the 
fault characteristic Cj. 

A 

Under certain conditions, it can be proven that rj!(x,0(r)) is a non-increasing func- 
tion of r. We will first derive the results under such conditions, the application of which 
to a more general ease is then discussed later in this section. For the former ease, the 
optimal retry duration r[ for the current retry can be readily obtained by the following 
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theorem. 


Theorem 2: When (a) the active duration of an intermittent fault has the density 

a 

function fHt\0), and (b) for t>t k the ratio -j—rr ~ - a non-decroasing likelihood ratio 

W\0(ti)) 

(18) — is non-decreasing in t, then the optimal retry duration is determined by: 

r' k = •«/ { rj r*(M(r))<r} (9) 

To prove this theorem, we need the following three lemmas. 

Lemma 2. Under the same conditions as in Theorem 2, let Tj and T k be random 

a a 

variables with the density functions /K<|0(//)) and /?(/|0(f*)), respectively, and 'l'(f) be a 
uon-decrcasing function of t, then E[^(7y)]>E[y(7t)j provided tj>t k . 

Proof of this lemma follows immediately from Lemma 2 of Chapter 3 in [18]. 


Let be the hazard rate function when the density function of 27 ’ s 

* A 

//)). The following Lemma gives the ordering of A?(<|0(f/)) with respect to tj. 

A 

Lemma 3. Under the same conditions as in Theorem 2, /7(t[0(f.-)) is a non-decreasing 
function of tj for every fixed t. 


Proof: 


For we have 


«j < mim 
wm) ~ ix> Mi.» 


for all *><. This inequality implies 


mm fin* m)<i" 

fli l Wk)) fl(u\0{t t ))du 


i-ffttmv 


Thus, if tj>tf 


Q.E.D. 
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Let Vi( z,0(O) =* min K*(z, 0(f),/?) A‘=>2,3,*1 where 0(f) js used in place of CfO(t)), 
Note in this ease that the active duration of the intermittent fault is distributed with 

A 

the parameter 0(0 and that all the other distributions arc known. 

Lemma 4. Under the same conditions defined as in Theorem 2, if then 

(i) V&*kh)) > V' k Mt s )) 2,3,4, 

(ii) r*(z,0(f,)) < r/(*,0(/ 8 )) for h=* 1,2. 

Proof: The proof for A— 3,4 is done by mathematical induction. Let V in (zAt } ),r 2 {n,j)) 

A— 3,4 be the expected times needed to complete the residual computation * when there 
are at most n retries to be attempted following the current one, and let r 2 {n,j ) be the 
maximum retry duration allowed. Also, let the optimal retry duration to achieve the 
minimum V‘ ki „(z,0(l } )) be rftnj). For n=0, Vi fi [x,(>{tj))= x and 

^.oM'i)) - F 3 i 0 (M(/o), 40 , 1 )) /J%(f,z,r 2 '(0,l)) {mkti))-Mhh)))<lt 

where m,z,y)=*t+x when Kj/and tf+z 0 +L when t> y. Since *(/,*, r?) is non-decreasing 
in t, the right-hand side of the above equation is non-negative as a result of Lemma 2. 
Also, since V^{xj)(t 2 )) is the minimum when the active duration of the intermittent 
fault has the density function /?(f|0(f 2 )), then 

WU,)) > MMKWo.i)) > vy iO (z,0(/ 2 )). 

Suppose that V^xAh)) > F 3 %(z,0(f 2 )) and V^xAh)) > V^Ah)) for all z 
provided t,>t 2 . It is obvious to see from Eq. (4) that K^^z,^)) > V\ <n ^{xAt 2 )) for 
all x. Thus, 

l/ 3, n+ i(*.0('l))- V3 ( „ + i(M(< 2 ),r 2 '(n+l,l)) > 

f 0 *UKn+iMh)),rgWl,l)) {W\kti))-f\t\0(k))}<lt 
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A 

$(/, Kj t „ i .j(x,0(<t))> r 2( ,, +l»l)j w non-decreasing in <<r 2 ’(n+l,l). Also, since r 8 '(n+l,l) is 

A a 

the optimal retry duration, ^)) < r 0 +t r Hence, <, K«' <, )),r 2 '(n+ 1,1)) is 

always non-decreasing in t. The right-hand side of the above equation becomes non- 
negative, resulting in V 3 %+iM'i))> ^3,n+iM'2)^n+U))> ^ ( »+i Mh))’ By 

A A 

mathematical induction, we have Vi{x,0{t{))> V{(x,0{t 2 )) for A*=3,‘i. 

A A 

To prove r 2 [x f 0{t x ))<r 2 {x,0{tff), following cases are examined. When 

A A 

r2(*.^(<i))=0» the relation is always true. When r 2 (z,0(fj))>O, using Lemma 3 and the 

A 

first part of this proof, the derivative of V 3 (x,0(tj),R) with respect to the retry duration r 
has the following ordering relationship for all r and 


n*M))+- 


wm) 


-{<*)+',} > vi(z,o(t2))+ 


i 


Af(rW«J 




where all retries after the current one arc assumed to employ the optimal policy. Thus, 
for ti>U, r 2 (z,0(/ 2 ))=oo when r 2 (z,^(/j))=oo, and r 2 (x,0(t 2 ))> r»{x)(t\)) when r 2 {x,0{h)) i 3 
finite. 

A 

For the case of k— 2, it is easy to see that V 2 (x,0(tj),R) is a linear combination of 
the effects of both transient and intermittent faults. Thus, V 2 ( x, 0( tj ), /?*)> V 2 (xfi{t^,R*). 
Also, the handling of V 2 with respect to rj has the same ordering relationship as that of 
V 3 with respect to r 2 . Thus, V 2 ( ar,^( />),/? *) > V^x)(t t ),R t ), and r/(^<y))<r/(^(fi)) when 
(/><*• Q-E.D. 


Lemma 4 shows that r[{x,0(t } )) is non-increasing in t } ■ for A-=l,2. Thus, there exists 

A 

an r such that r>r*(z,0(r)). The proof of Theorem 2 is given as follows: 

Proof of Theorem S: Suppose that the retry has been applied for the period r but the 

A 

fault is still active. When rl(x,6(r))>r, the retry should be continued since it decreases 
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the expected task completion time. Thus, r k {x)>tup (r; r/(;r,0(r))>r}. Suppose there is 
an r,€{rj r*W(r))< r) . Then Vy (z ( 0(r ; ),ry)> Vy (M(r y ))> Vy (xM) where r‘ k is 
defined in Eq. (9) and h J =*/;+!. Thus, the theorem follows, Q.E.D. 


For the same example in Section 3.2, suppose the active duration of an intermittent 
fault is exponentially distributed with an unknown disappearance rate /». Using a 
method similar to the Cohen's derivation in {13}, the maximum likelihood estimator fi(r) 
for an exponential distribution - which maximizes log L(/t) -* is obtained as; 


/.(r) = (r.- s /,) — 

£l,+ r 
1=1 


( 10 ) 


Theorem 2 gives the optimal stopping time for the current retry. Note that the 
true value of /t is unknown and its maximum likelihood estimator is to determine the 
optimal retry duration. In the case of retry for a reappearing intermittent fault, the 
optimal retry duration for a given /* is either 0 or co as shown in Corollary 1. Using 
Theorem 2 and Eq. (10), we get the optimal retry duration as follows: 



Note that the gamma distribution has a non-decreasing likelihood ratio for both a 
and /? [18]. Furthermore, the estimators provided by Cohen [15] show that both the 
estimated a and /? are increasing in the current retry period r. Thus, Theorem 2 can be 
applied directly when the active duration of the intermittent fault has the gamma distri- 
bution. When the distribution of the active duration is WeibnII, Theorem 2 cannot be 
applied directly, but si ill provides a good approximate solution. This is due to the fact 
that the Wcibuil distribution has a non-decreasing likelihood ratio with respect to its 
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scale parameter only. Since (i) the variation of the estimated shape parameter a with 
respect to the current retry duration r is always less than that of the estimated scale 
parameter /?, and (ii) the estimated 0 is increasing in r when a is fixed, a reasonably 
good approximation can be obtained by assuming that a is constant during the current 
retry and 0 is estimated using both the past and current samples as discussed in the 
above. 

There are some shortcomings when the maximum likelihood estimator is used for 
the progressively censored samples. Particularly, the estimator is biased when the sam- 
ples arc censored. Also, in the case of the exponential distribution, fi docs not contain 
sufficient statistics of n when the samples arc censored and incomplete, i.c., when there 
exists at least one sample (/,,<,) with /,= 1. These shortcomings can be seen easily from 
a trivial example: [i becomes zero when 4=1 for all i'=l,2,..ri. In fact, as shown by van 
Zwet [10], for most practical eases it is impossible to obtain unbiased estimators when 
the samples arc Type I censored in a semi-infinite interval. Note, however, that there is 
no restriction about which estimator to be used in the foregoing determination of the 
optimal retry policy, meaning that estimators other than the maximum likelihood esti- 
mator can be used without altering our method described thus far. 

5. BAYES SEQUENTIAL ANALYSIS AND OPTIMAL RETRY 

In the previous section, the unknown parameters of a distribution are estimated 
first, and the optimal retry policy is then determined using the estimated results. In this 
section, the same problem is attacked by taking the Bayes approach. Since the reap- 
pearances of an intermittent fault during the execution of a task are a renewal process, 
there could be a sequence of retries for the same intermittent fault. Thus, it is natural 


21 


Leo and Shin 


May 4, 1084 


to incorporate the Bayes sequential analysis to both c/iarflc/er»:c the intermittent fault 
and determine the optimal retry policy. Retry is then considered as sampling of the 
fault behavior, and the retry duration controls both the task completion time and the 
information to be collected, Since the retry for a newly detected fault only occurs once 
within the execution of a task {under the assumption that E(2y| is much larger than the 
other durations), we focus on the behavior of intermittent faults and thus the determi- 
nation of rj. 

5.1. Optimal Retry and Bayes Decision 

Let the distribution of 77 be governed by some unknown parameters IK,. Note 
that H7 may be a scalar (c.g. for the exponential distribution) or a vector (e.g. for- the 
gamma or VVcibull distribution). The a pr/on'information concerning is exprcss-.d jn 
terms of a probability distribution function defined on fi, Let the density function of IK,- 
be £,{u>). Denote further the fault characteristics, given W{ and the prior density function 
£i» by Cf iWi and C^, respectively. 

To apply the Bayes decision theory, the risk with a retry policy R, given £,• and the 
residual computation x, is defined as follows: 

*=3,4 (12) 

Thus, the (optimal) Dayea riak is given as 

Pk{x,^,) = inf p k {x,^ { ,R) A— 3,4 ( 13 ) 

Since we are now concerned only with the retry for an intermittent fault, R consists of r a 
only. The optimal retry duration in case of the detection of an old intermittent fault, 
r^iTjC/jt )), abbreviated by yields the Bayes risk /^(z,?,). Similarly, the Bayes 
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risk of the retry for a newly detected fault can be defined with Eqs. (12) and (13), How- 
ever, the determination of r/(ar, ^,) is a one stage Bayes decision problem. Once p^t,^) 
and arc obtained, the normal form of analysis [22] can be applied directly for the 

solution of r(( *,£,). 


Following a retry attempt for an intermittent fault, regardless whether it fails or 
succeeds, nn event related to the fault active duration 27 is observed. The event 
observed during a retry of the duration r is either "success" or "fail". The "success" 
event, denoted by e‘(<), occurs when the detected fault disappears after the retry dura- 
tion f which is less than or equal to the maximum allowable retry duration r. The "fail" 
event, denoted by </(r), occurs when the detected fault docs not disappear by the end of 
the retry duration r. Let -S^r)— {«'(/); <<r) (J (</(r)}. With the prior density function 
£,(w), the posterior density function following the observation of cGS(r), denoted by 
£,(u>|c) *'—1,2, becomes 


g( e M g , M 

! n Mw) ?,(») dw 


(U) 


where £/(c|uj) is the generalized conditional density function for the event e as in [20], i.c., 


( The density function of 27 at l if e —c'(t) and Kr 

c-cV) " <15) 

This posterior density function will become the prior density for the next retry. Conse- 
quently, the system’s behavior is similar to a sequential decision procedure which deter- 
mines first a retry policy and then observes the resulting sample. The procedure will be 
repeated with a new prior distribution which is determined on the basis of the new sam- 
ple observed and the old prior distribution. The decision on retry and the sampling for 
fault characterization will continue as long as there is an occurrence of fault. 


23 






Leo and Shin May 4, 1084 

The problem of selecting the optimal retry policy can clso be treated ns the optimal 
stopping problem with continuous observations [21], Suppose an intermittent fault is 
detected again when the residual computation is z, Then, retry is applied for a specified 
stopping time r, The task will be continued, without applying recovery methods other 
than retry, if the fault disappears during the retry period r. Otherwise, it has to bo res- 
tarted from the beginning. 7 The posterior density function of to,* becomes £,(it>]e'(/)) or 
£,{u'|</(r)), depending on the outcome of retry. The cost of an observation is the amount 
of time used for monitoring the fault until its disappearance, i.e., c(e'(<))=>f, or until the 
end of retry, i.e., c(d{r))<=r. The costs associated with the termination of retry arc 
defined ns the amount of time necessary to complete the residual computation z as fol- 
lows: 

L(z,r,£,|e'(0) = P<WH c '(0)) 
f-( z >r,f,|c/(r)) = z 0 + t, 

The expected loss for the stopping time ro is the same as the Bayes risk defined in 
Eq. (13). According to the theory presented by Irlc in [21], there always exists an 
optimal stopping time, rj6[0, oo), satisfying Eq. (13). 

We will in the next section solve the sequential decision problem using the back- 
ward induction [20] for testing hypotheses where the prior and posterior distributions arc 
confined within the open unit interval, i.e., (0,1). Note that the miniinax method in [22] 
cannot be used to solve Eqs. (12) and (13), since the decision space — which consists of 
all possible maximum retry durations - is neither compact nor finite. 


7 For simplicity, it is assumed that there is only one alternative to the retry recovery, i.e., restart. 
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5.2. Optimal Retry and Hypotheses Testing 

Suppose that there arc a primary and some alternative hypotheses concerning the 
characteristics of an intermittent fault, Consider the sequential testing of these 
hypotheses and the simultaneous determination of the optimal retry policy; this is not 
difficult to solve since both the prior and posterior probabilities lie in the same unit 
interval (0,1). For given hypotheses, the initial prior distribution can be assumed to be 
equally likely among the hypotheses, 


To be more specific, take a demonstrative example 8 in which the active duration of 
an intermittent fault is assumed to be exponentially distributed with an unknown 
parameter //. Let there be two hypotheses on /i, // 0 and // t for /i«=y< 0 and respec- 
tively, and let The uncertainty associated with these hypotheses can be 

represented by the probability h of having /ie»/i 0 , VVe will first determine the optimal 
retry policy for all /t£(0,l). Then, we will consider the problem of testing hypotheses as 
well as estimating the expected sample size to reach a certain significance level under the 
optimal retry policy. 

Consider the optimal retry duration r 2 (x,/i) upon detection of an old intermittent 
fault, In this ease, we get the posterior probabilities given the events e'(f) and 
denoted by h(t) and /i(r), respectively ns follows: 


/<r)= 


hfi qc "" 0 ' 

hi" 



where Kr 


( 10 ) 

(17) 


“As will be pointed out near the end of this section, the results obtained from this example arc 
appiicable/extendable to more general eases. 
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As was discussed in Section 3.2, we can compute zg for n given /t,-, denoted by 
*«(/<,) i ' i =» 0 , 1 , such that (i) r^l^oo if z<xg(/t 0 ), or 0 otherwise, and (ii) if 

2 <i)(/Ji), or 0 otherwise. Since za(i l o)> T !(l‘i)i rg(z,/i)^oo if z<tg{n\), and rj(i,/()=>0 if 
2 >a’o(/j 0 ), Note that the above represents extreme cases of retry, i.c,, retries of duration 
zero or infinite, 

For the non-extreme ease, i.e., the ense of *i(iti)<x<Zg(ii 0 ), let 
A* = tup{h j Since rg(x, l)=oo and ro(a:,0)=0 for ^2(/io)< x < x 2(/ 4 1)# wc get 

0</i'< 1. For all h>h\ r;f(^,/0 >0, i.c., retry must be applied upon detection of a fault, 
Suppose retry has been applied for a small duration 6r<rn(x,h). Then, the mcmoryless 
property of the exponential distribution leads to the following equation: 

Pi(x,h) « (1 - F7((5r|A)) (Sr 4* /> 3 '(;r,/i(<h))) 4* f Q /?(lf/i) {* 4- p A (z,h(t))dt (J8a) 


By letting 5r-*0 and changing variables, Eq, (18a) becomes 



(18b) 


On the other hand, p 3 *{a:,/i) = r 0 4* t, for all h<h*. Using the same approach as in 
Theorem 1, we can prove that h* satisfies the following equation 


AiW(O)) = *o + t, ~ T7 


1 




( 10 ) 


From Eq. (4) and the definition of p A in Eq. (13), p&z,h) is expressed as: 

p*(x,h) = J (l-c"" 1 ) 4- e~‘ ,z f o vc vy p's(y,h)dy (20) 

With the initial conditions r/(z,l)=oo, Pi(x,h) and p A (x,h) for x<z^(/i[), and Eqs. (18)- 
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(20), we can calculate pftz,h) 4*j= 3,'I for all zG^/ij), Zo(/i 0 )) with the following numeri- 
cal algorithm: 

Al. Set A=l. 

A2. Calculate /> 3 (:r,l) and p^(x,l) for all x£(z!>{ /i|), ZgO^o))' 

<//?(( x.h) 

A3. Calculate • — — — using Eq. (18) and pfa,h-6h) for all igfa^/ij), x 'Kl‘ o))* (Note 
fl’*(x,h) and a!(z,/i( 0)) arc both known.) 

A4. Calculate p \{x t h~6h) using Eq. (20) for all zG (*•;>(/* i)> a^f/io))- (Note p^(x,h-Sh) is 
known for all x.) 

A5. Set h—h-Sh, If /<< 0, terminate the algorithm. 

AO. If ,;w(0)) < , S» to A3. 

Otherwise, set p$(x,h-5li) — x 0 + l, and go to Ad. 

From the test at AO, one can determine k * for all x€(zn(ni) f a^(/<o)) so as to satisfy 
Eq. (10). Due to the mcmorylcss property of the exponential distribution, r«(x,h)=G 
when /:</:' or satisfies Eq. (17) with h(r)—h * if h>h\ In Figure 7, rj versus the prior 
probability h is plotted for various values of the residual computation z. Intersections of 
the curves in Figure 7 with the horizontal axis give the values of h * for different values 
of x. 

Remark 1: in case the active duration of an intermittent fault has a general distribu- 
tion (instead of an exponential distribution), a differential equation similar to Eq. (18b) 
cannot be obtained. In such a ease, the original integral equation of /J 3 (z,£ t '), i.e., the 
combination of Eqs. (3) and (12), has to be used instead. 
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From the foregoing discussion we can determine the optimal retry policy that is 
baaed on the prior probability h. Under this optimal retry policy, we can also determine 
trajectories of the posterior probabilities after a large number of occurrences and reap- 
pearances of intermittent faults have been observed. Let each retry be numbered by a 
two-tuple (rn,» w ) on the basis of occurrences and reappearances of intermittent faults. 
The (m,n m )-th retry is used to recover from the m-th occurrence of fault in case of « m «=0 
or from the » m *th reappearance of the m-th intermittent fault if n m y^Q. For the 
hypotheses //, i— 0,1, let h t (tn,n m ) represent the posterior probability after the (m,ri m )-th 
retry is applied. Also, let be the total number of reappearances of an intermittent 
fault during the execution of a task and hfm) be the prior probability before the m-th 
occurrence of fault, which is equal to h,(m - by definition, There are now two 
main problems to be addressed: (i) Will A,(m) converge to either 0 or 1, namely to the 
true fault characteristic as m— *oo ?; (ii) If converges, how fast will it converge ? For con- 
vergence, we get (he following theorem. 


Theorem 3. Let A/ — inf {m | A,{m)>x-f, or A,(m)<c} where 0<£<1. If 0<A,{0)<1, 

and :r 0 -H, — - >0 for all hypotheses //,• and all tasks, then Prob{M<oo) — I and 
l l i 

E[ A/]<oo. 


Proof: Let S,(tn)=\og~~ — -j for jj^i. Thus, A/ can be defined as inf {tn \ |S,{m)|>/C}, 

m ) 

where /<==log(~ ), Let g,( m, n w ) = 1 og y— ■ - ~ ~ where e(m,n m ) is the event 

* ff[c( m,n m )|/iy)] 

obaerved at the (m,n m )-th retry and g(e |/i ; ) is the generalized conditional density function 
defined in Eq. (15). (When the retry duration defined by a retry policy is zero, e(m,n m ) 
is null and c,(m,fj m )=0. Also, when n m =0, the retry duration is r[ since the fault type is 
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not known at its first occurrence.) From Bayes theorem, we have 


Sin/ ) Sim' -1) + £ sin/ ,o) 


tri 

£ £*.("»,») + log 


rn=3 1 n=0 


hi 0) 
hJL o) 


Let y t { m)— £ f,(rn,n) under the optimal retry policy. S,{m) becomes the sum of indepen- 

n=>0 

dent random variables. After an event is observed, the expected value of is the 

Kullback-Lciblcr information number and is greater than or equal to zero when //; is 
true (23). In this case, /?(;,{ m,n m )]—0 if and only if the prior probability before the 

(w,n m )-tli retry is 0 or 1. Since z 0 +f, — - >0 for all hypotheses //; and all tasks exe- 

cutcd, Fro6(r,^0)>0 i— 1,2, Hence, Prob\y,{m)~ 0] < l for all m<M. Following the 
proof in [2<l] that the sampling of a sequential probability-ratio test (SPRT) terminates 
with probability 1, Prob(M<oo)=l and £[A/]< co are obtained. Q.ED. 


Remark 2; Since the tasks affected by intermittent faults do not have to be identical, 
the random variables j/,(l), J/,(2), • • • are independently but not identically distributed. 
Moreover, for a fixed m, *,{»», n m )*s arc dependent on one another because the events 
observed are controlled by the retry duration which is in turn a function of the moment 

of reappearance. However, all c,{m,rj m )>0 when //,- is true. The condition, x Q +t , — - >0 

Ml 

for all hypotheses //,• and all tasks executed, indicates that retry is always a useful 
recovery when an intermittent fault is detected. In fact, this condition is not necessary 
■ true for all tasks, but Theorem 3 holds as long as Prob(rl> 0)^0. 

Theorem 3 shows that the expected number of faults observed — that makes the 
posterior probability reach either e or 1-c -- is finite. This also holds for other distribu- 
tions and retry policies as long as rj^O and r^O for some x , However, it does not pro- 
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vide the average sample size, E[M | //,], that is necessary to reach these termination 
boundaries K and ~K. Also, one has to justify whether or not the posterior probability 
at the termination implies the true fault characteristic. In other words, it is important 
to know the error probability, Prob(S,(\f)<-K | //,). 

There are two difficult aspects in the evaluation of E\M | //J and 
Prob(SlAf)<-K | II,); one is that y/mj’a are not identically distributed, and the other is 
the non-existence of closed form solutions for both rj and r 2 *. If the same task is exe- 
cuted repeatedly under the condition x 0 +t ,- — - >0 for all hypotheses, then y,(m)’s 

become independently and identically distributed. Assume further that initially, both 
hypotheses are equally likely, i.c., A 0 (0)=A 1 (0)=0.5. Using the characteristics of SPRT 
in [20], the error probability is approximated by: 

Prob(S,{M)<-K | //.) ~ ~ e K 

6 

Even if the same task is executed repeatedly, it is very difficult to obtain an exact 
solution for E[y] because of the dependency between the optimal retry durations and the 
observed samples of the active durations. This fact in turn makes it impossible to 
obtain the exact solution of E[Af | //J. Due to the above difficulties, in what follows, we 
will derive upper and lower bounds of E[M | //,-] instead of an exact solution. 

Suppose there are two retry policies R° and R l with the retry durations (r^r!}) and 
(rJ,ro), respectively. r?(z,/i) and r\(x,h) are defined the same as r[{x,h). r{(x,h) is equal to 

"m 

oo if and 0 otherwise for /=0,1. Let y[{m) and Mb e J i,(rn, n) and the 

n=0 

number of faults observed to reach the termination boundaries under the retry policy R\ 
respectively. Then, (i) Prob(AP< oo)— 1 and E[A/']<oo, and (ii) L|j^]<E[y,j<E[ir?j. 
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(Note that the indices rw are omitted because of the distributions being identical.) Once 
E[\}i | //,] ; m 0,l is calculated as in the Appendix, the expected sample size to reach the 
boundaries 1-f and c Is bounded as: 


E[Af 1 //,] < E[M | //;) < E[M l | //,] 


where E[Af | //,] 



0,1 (see [20] for more on this). 


The above equations give the error probability and the bounds of the expected sam- 
ple size when a certain level of significance is to be achieved. These bounds of E[M\ //J 
become tight when the difference between /j 0 and /q is small. Of course, the expected 
sample size under the optimal retry policy is larger than that for the case when the com- 
plete information about active duration is observed, i.c., r 1 =r 2 =oo. 

Thus far, we have discussed solutions to the problem of sequential retry decision 
and hypotheses testing only for the case of exponentially distributed durations. Notice, 
however, that (i) the same method, with little modification, can be applied to the cases 
with any other kind of distributions, and (ii) Theorem 3 holds as long as 
/-*r o 6 ( t/,( m) *== 0) < 1 . Moreover, the method can be extended to the testing of multiple 
alternative hypotheses by specifying the prior and posterior probabilities as a vector, 
each element of which represents the probability that the corresponding hypothesis is 
true. 


0. CONCLUSION 

In this paper, we have investigated optimal retry policies with known and unknown 
fault characteristics. Retry not only saves the recovery overhead but provides a means 
to estimate the unknown characteristics. Although the data resulting from retries are 
censored, they are the only significant means of monitoring the fault characteristics. 
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Naturally, the monitored results are different from those obtained during device 
manufacture. 

In the discussion of retry policies, retry durations are assumed to be continuous. In 
fact, the retry durations should be discrete since the time required for repeated execution 
of an operation cannot be cascaded into a single continuous duration. Since the 
expected risk is a continuous function of the retry duration, it is not difficult to find the 
optimal retry policy which is specified as a number of retry attempts. 

As was pointed out in the discussion of the expected sample size for reaching a cer- 
tain level of confidence in hypotheses testing, the test under the optimal retry policy 
turns out to be inefficient in the sense of maximizing the information observed. This is 
due to the fact that the optimal retry policy is defined to minimize the total completion 
time of the task affected by the occurrence of fault. Thus, the retry policy is a local 
optimum; "optima!” only for the task involved. Clearly, the retry policy which gives 
complete maximum information should have infinite retry durations, although such a 
retry policy is totally unacceptable in reality. It would be interesting to examine the 
trade-off between the two extreme objectives, i.c., minimizing the local task completion 
time and maximizing the information to be collected. This problem can be formulated 

i m 

as the minimization of the asymptotically accumulated risk, lim — ^\Pk{ x i £})], where 

m-co m ; - =1 

j and m are used to number the successive retries and Cj is the measured fault charac- 
teristic at the j-tli retry. It can also indicates that the global optimal retry policy should 
collect more information (it is definitely not complete though) from the beginning to 
speed up the estimation of the true fault characteristics and then implement the local 
optimal retry policy once the true characteristics are obtained. 
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Another important aspect is the choice of an accurate model for the fault behavior. 
As was discussed in Sections *1 and 5, the optimal retry policy and the measurement of 
the fault characteristics are dependent on the family of density functions that arc ini- 
tially selected, The suitability of chosen models can be validated through goodness-of-fit 
tests, e.g,, chi-square goodness-of-fit. Although sometimes the expected task completion 
time may not be minimized because of the poor choice of model, the information col- 
lected via retries can still be used to check the suitability of the model. Thus, after a 
large number of samples have been obtained, it is possible to select an appropriate form 
of density function and then achieve the minimum task completion time, The other 
approach is to begin with hypotheses of various forms of density functions. As sampling 
progresses, the parameters associated with the density function forms are estimated and 
then the hypotheses are tested. 

The work presented in this paper is to incorporate the capability of real-time esti- 
mation (of the fault characteristics) and decision (on optimal retry policies) into the com- 
puter system. The results are a self-adjustable (thus intelligent) system and a powerful 
measurement of the fault characteristics. This idea can also be extended to other appli- 
cations, e.g. the measurement of program behavior and the simultaneous decision of sys- 
tem configuration or scheduling. Such extensions would be significant contributions 
towards the construction of highly intelligent computer systems. 
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Appendix: The Expression of E[yj|H|] 

The retry duration r£ under the retry policy W is equal to oo if z<xZ(iij) and 0 oth- 
erwise. Thus, the complete information will be gathered if an old intermittent fault is 
detected again at x< zo'(/i ; ) and no information will be obtained if detected at x>x^(nj). 
Hence, if the retry for a newly detected intermittent fault when the residual computa- 
tion is x succeeds, we expect to collect information from the successive retries before the 
task completion as follows: 


E[L\x\ « E[ S «,{m,r>)|z] 

fl=U 


vx (log 


/',• HrH i 


ih 


) 




Hi 

Hi Hr Hi 


tn&Hj) (log- 1 - - L - J ) 


Hi 


Hi 


if X<Zo'(/ly) 

otherwise 


Let the maximum retry duration for a newly detected fault be rftz) when the resi- 
dual computation is x, Also, let x d ) be the density function of the detection time of a 
new intermittent fault, x dl given that it is detected during the task execution. Then, 

ijj(x d )— — ■ — — where X,=p,X. Thus, we have E[t/^H,) as follows: 

1-e ,x ° 



me' ,,, ilj{x 0 -x){Pi{t)+E[l2\x}}dtdx 


where /{(r)=-(/i,-/iy)r is the information collected from an unsuccessful retry of the max- 
imum retry duration r and /f(r)=log— 1 1 - — — J r is the resulting information when the 

Hj Hi 

.retry succeeds after the duration r. 
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NF: non-faulty 
F: faulty 

FB: fault-benign 


Figure 1. The Model of Faults. 
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Figure 2. 


xl(z,Cj) versus the Shape Parameter a when the Hazard Rate 
is Increasing. 
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Figure 3. 


The Optimal Retry Duration ro(z,Cy) for Weibuli Distributions 
with Decreasing Hazard Rate. 
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